Last week we trained a model for churn. How do we know if it's good?
The fourth week of Machine Learning Zoomcamp is about different metrics to evaluate a binary classifier. These measures include accuracy, confusion table, precision, recall, ROC curves(TPR, FRP, random model, and ideal model), AUROC, and cross-validation.
Metric - function that compares the predictions with the actual values and outputs a single number that tells how good the predictions are
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import urllib.request
url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'
filename = 'data-week-3.csv'
df = pd.read_csv(url)
df.columns = df.columns.str.lower().str.replace(' ', '_')
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
for c in categorical_columns:
df[c] = df[c].str.lower().str.replace(' ', '_')
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
df.churn = (df.churn == 'yes').astype(int)
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values
del df_train['churn']
del df_val['churn']
del df_test['churn']
numerical = ['tenure', 'monthlycharges', 'totalcharges']
categorical = [
'gender',
'seniorcitizen',
'partner',
'dependents',
'phoneservice',
'multiplelines',
'internetservice',
'onlinesecurity',
'onlinebackup',
'deviceprotection',
'techsupport',
'streamingtv',
'streamingmovies',
'contract',
'paperlessbilling',
'paymentmethod',
]
dv = DictVectorizer(sparse=False)
train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)
model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)
y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()
0.8034066713981547
Accurcy measures the fraction of correct predictions. Specifically, it is the number of correct predictions divided by the total number of predictions.
We can change the decision threshold, it should not be always 0.5. But, in this particular problem, the best decision cutoff, associated with the hightest accuracy (80%), was indeed 0.5.
Note that if we build a dummy model in which the decision cutoff is 1, so the algorithm predicts that no clients will churn, the accuracy would be 73%. Thus, we can see that the improvement of the original model with respect to the dummy model is not as high as we would expect.
Therefore, in this problem accuracy can not tell us how good is the model because the dataset is unbalanced, which means that there are more instances from one category than the other. This is also known as class imbalance.
Classes and methods:
np.linspace(x,y,z) - returns a numpy array starting at x until y with a z stepCounter(x) - collection class that counts the number of instances that satisfy the x conditionaccuracy_score(x, y) - sklearn.metrics class for calculating the accuracy of a model, given a predicted x dataset and a target y dataset.len(y_val)
1409
(y_val == churn_decision).sum() / len(y_val)
0.8034066713981547
So we have the accuracy of our base model using 0.5 as the cutoff for churn, no churn. Now we will vary the 0.5 to different numbers to see if the accuracy of our model is better or not.
The np.linspace() method can be used to create an array. In this case we want the values in the array to be from 0 to 1 and we want 21 of them. That will start at 0 and increment by .05 each step.
thresholds = np.linspace(0, 1, 21)
thresholds
array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
scores = []
for t in thresholds:
churn_decision = (y_pred >= t)
score = (y_val == churn_decision).mean()
print('%.2f %.3f' % (t, score))
scores.append(score)
0.00 0.274 0.05 0.510 0.10 0.591 0.15 0.666 0.20 0.710 0.25 0.739 0.30 0.760 0.35 0.772 0.40 0.785 0.45 0.793 0.50 0.803 0.55 0.801 0.60 0.796 0.65 0.786 0.70 0.766 0.75 0.744 0.80 0.734 0.85 0.726 0.90 0.726 0.95 0.726 1.00 0.726
plt.plot(thresholds, scores)
[<matplotlib.lines.Line2D at 0x29303c8cd30>]
We can see that 0.5 is the best one. We used our own function for calculating accuracy (y_val == churn_decision).mean(). sklearn has a function for accuracy, accuracy_score. Let's implement that.
from sklearn.metrics import accuracy_score
accuracy_score(y_val, y_pred >= 0.5)
0.8034066713981547
Let's add that into our previous function to calculate accuracy with the thresholds array.
scores = []
for t in thresholds:
score = accuracy_score(y_val, y_pred >= t)
print('%.2f %.3f' % (t, score))
scores.append(score)
0.00 0.274 0.05 0.510 0.10 0.591 0.15 0.666 0.20 0.710 0.25 0.739 0.30 0.760 0.35 0.772 0.40 0.785 0.45 0.793 0.50 0.803 0.55 0.801 0.60 0.796 0.65 0.786 0.70 0.766 0.75 0.744 0.80 0.734 0.85 0.726 0.90 0.726 0.95 0.726 1.00 0.726
plt.plot(thresholds, scores)
[<matplotlib.lines.Line2D at 0x29303cd7730>]
At the threshold of 1.0 we can see our prediction rate is 72.6%. That must mean that our churn rate in the validation set is 72.6%. Let's verify that using the Counter method, which simply counts the number of True or False.
from collections import Counter
Counter(y_pred >= 1.0)
Counter({False: 1409})
As we can see the number of False records is at 1409, which is the size of our array, so none of the values are greater than 1. Below we can simply do a small calculation and see that it is simply a calculation of the churn rate of our dataset.
1 - y_val.mean()
0.7260468417317246
So the actual non churn rate is at 72.6% for the validation dataset and our model is only 7.4% better. So accuracy isn't the best choice for scoring our model. This is because we have what is called class imbalance. We have 72.6% of customers that don't churn compared to 27.4% of customers that do churn.
y_val.mean()
0.2739531582682754
Confusion table is a way to measure different types of errors and correct decisions that binary classifiers can made. Considering this information, it is possible evaluate the quality of the model by different strategies.
If we predict the probability of churning from a customer, we have the following scenarios:
The confusion table help us to summarize the measures explained above in a tabular format, as is shown below:
| Actual/Predictions | Negative | Postive |
|---|---|---|
| Negative | TN | FP |
| Postive | FN | TP |
The accuracy corresponds to the sum of TN and TP divided by the total of observations.
actual_positive = (y_val == 1)
actual_negative = (y_val == 0)
actual_positive, actual_negative
(array([False, False, False, ..., False, True, True]), array([ True, True, True, ..., True, False, False]))
t = 0.5
predict_positive = (y_pred >= t)
predict_negative = (y_pred < t)
predict_positive, predict_negative
(array([False, False, False, ..., False, True, True]), array([ True, True, True, ..., True, False, False]))
In effect, what we are setting up is a logic AND argument. If both are True then the output is True, if both are False then the output is True, if either is different then we get an output that is False
predict_positive & actual_positive
array([False, False, False, ..., False, True, True])
(predict_positive & actual_positive).sum()
210
We have 210 True Positives and 922 True Negatives
tp = (predict_positive & actual_positive).sum()
tn = (predict_negative & actual_negative).sum()
tp, tn
(210, 922)
We have 101 False Positives and 176 False Negatives
fp = (predict_positive & actual_negative).sum()
fn = (predict_negative & actual_positive).sum()
fp, fn
(101, 176)
confusion_matrix = np.array([
[tn, fp],
[fn, tp]
])
confusion_matrix
array([[922, 101],
[176, 210]])
What all of this is telling us is that with our current model we would be sending out 101 discount emails to people who not at risk of churning and we would be missing out by not sending out 176 discount emails to people who were going to churn. In the first case we are losing money because we gave discounts to people who were not at risk of churning and in the second case we are missing out on future money by not attempting to retain customers that are at high risk for churning.
(confusion_matrix / confusion_matrix.sum()).round(2)
array([[0.65, 0.07],
[0.12, 0.15]])
Precision tell us the fraction of positive predictions that are correct. It takes into account only the positive class (TP and FP - second column of the confusion matrix), as is stated in the following formula:
TP / (TP + FP)
Recall measures the fraction of correctly identified postive instances. It considers parts of the postive and negative classes (TP and FN - second row of confusion table). The formula of this metric is presented below:
TP / (TP + FN)
In this problem, the precision and recall values were 67% and 54% respectively. So, these measures reflect some errors of our model that accuracy did not notice due to the class imbalance.
(tp + tn) / (tp + tn + fp + fn) # accuracy score of our model
0.8034066713981547
In the precision score we are only interested in scores that equal churn. In this case we would be interested in the TP and FP.
p = tp / (fp + tp)
p
0.6752411575562701
r = tp / (tp + fn)
r
0.5440414507772021
ROC stands for Receiver Operating Characteristic, and this idea was applied during the Second World War for evaluating the strenght of radio detectors. This measure considers False Positive Rate (FPR) and True Postive Rate (TPR), which are derived from the values of the confusion matrix.
FPR is the fraction of false positives (FP) divided by the total number of negatives (FP and TN - the first row of confusion matrix), and we want to minimize it. The formula of FPR is the following:
In the other hand, TPR or Recall is the fraction of true positives (TP) divided by the total number of positives (FN and TP - second row of confusion table), and we want to maximize this metric. The formula of this measure is presented below:
ROC curves consider Recall and FPR under all the possible thresholds. If the threshold is 0 or 1, the TPR and Recall scores are the opposite of the threshold (1 and 0 respectively), but they have different meanings, as we explained before.
We need to compare the ROC curves against a point of reference to evaluate its performance, so the corresponding curves of random and ideal models are required. It is possible to plot the ROC curves with FPR and Recall scores vs thresholds, or FPR vs Recall.
Classes and methods:
np.repeat([x,y], [z,w]) - returns a numpy array with a z number of x values, and a w number of y values.roc_curve(x, y) - sklearn.metrics class for calculating the false positive rates, true positive rates, and thresholds, given a target x dataset and a predicted y dataset.TPR is the same equation as recall
tpr = tp / (fn + tp)
tpr
0.5440414507772021
fpr = fp / (tn + fp)
fpr
0.09872922776148582
scores = []
thresholds = np.linspace(0, 1, 101)
for t in thresholds:
actual_positive = (y_val == 1)
actual_negative = (y_val == 0)
predict_positive = (y_pred >= t)
predict_negative = (y_pred < t)
tp = (predict_positive & actual_positive).sum()
tn = (predict_negative & actual_negative).sum()
fp = (predict_positive & actual_negative).sum()
fn = (predict_negative & actual_positive).sum()
scores.append((t, tp, fp, fn, tn))
scores
[(0.0, 386, 1023, 0, 0), (0.01, 385, 914, 1, 109), (0.02, 384, 830, 2, 193), (0.03, 383, 766, 3, 257), (0.04, 381, 715, 5, 308), (0.05, 379, 683, 7, 340), (0.06, 377, 661, 9, 362), (0.07, 372, 640, 14, 383), (0.08, 371, 613, 15, 410), (0.09, 369, 580, 17, 443), (0.1, 366, 556, 20, 467), (0.11, 365, 528, 21, 495), (0.12, 365, 509, 21, 514), (0.13, 360, 477, 26, 546), (0.14, 355, 453, 31, 570), (0.15, 351, 435, 35, 588), (0.16, 347, 419, 39, 604), (0.17, 346, 401, 40, 622), (0.18, 344, 384, 42, 639), (0.19, 338, 369, 48, 654), (0.2, 333, 356, 53, 667), (0.21, 329, 341, 57, 682), (0.22, 323, 322, 63, 701), (0.23, 320, 313, 66, 710), (0.24, 316, 304, 70, 719), (0.25, 309, 291, 77, 732), (0.26, 304, 281, 82, 742), (0.27, 303, 270, 83, 753), (0.28, 295, 256, 91, 767), (0.29, 291, 244, 95, 779), (0.3, 284, 236, 102, 787), (0.31, 280, 230, 106, 793), (0.32, 278, 225, 108, 798), (0.33, 276, 221, 110, 802), (0.34, 274, 212, 112, 811), (0.35000000000000003, 272, 207, 114, 816), (0.36, 267, 201, 119, 822), (0.37, 265, 197, 121, 826), (0.38, 260, 185, 126, 838), (0.39, 253, 179, 133, 844), (0.4, 249, 166, 137, 857), (0.41000000000000003, 246, 159, 140, 864), (0.42, 243, 158, 143, 865), (0.43, 241, 150, 145, 873), (0.44, 234, 147, 152, 876), (0.45, 230, 135, 156, 888), (0.46, 224, 125, 162, 898), (0.47000000000000003, 218, 120, 168, 903), (0.48, 217, 115, 169, 908), (0.49, 213, 110, 173, 913), (0.5, 210, 101, 176, 922), (0.51, 207, 99, 179, 924), (0.52, 204, 93, 182, 930), (0.53, 196, 91, 190, 932), (0.54, 194, 86, 192, 937), (0.55, 185, 79, 201, 944), (0.56, 182, 76, 204, 947), (0.5700000000000001, 176, 68, 210, 955), (0.58, 171, 61, 215, 962), (0.59, 163, 59, 223, 964), (0.6, 151, 53, 235, 970), (0.61, 145, 49, 241, 974), (0.62, 141, 46, 245, 977), (0.63, 133, 40, 253, 983), (0.64, 125, 37, 261, 986), (0.65, 119, 34, 267, 989), (0.66, 114, 31, 272, 992), (0.67, 105, 29, 281, 994), (0.68, 94, 26, 292, 997), (0.6900000000000001, 88, 25, 298, 998), (0.7000000000000001, 76, 20, 310, 1003), (0.71, 63, 14, 323, 1009), (0.72, 57, 11, 329, 1012), (0.73, 47, 10, 339, 1013), (0.74, 41, 8, 345, 1015), (0.75, 33, 7, 353, 1016), (0.76, 30, 6, 356, 1017), (0.77, 25, 5, 361, 1018), (0.78, 19, 3, 367, 1020), (0.79, 15, 2, 371, 1021), (0.8, 13, 2, 373, 1021), (0.81, 6, 0, 380, 1023), (0.8200000000000001, 5, 0, 381, 1023), (0.8300000000000001, 3, 0, 383, 1023), (0.84, 0, 0, 386, 1023), (0.85, 0, 0, 386, 1023), (0.86, 0, 0, 386, 1023), (0.87, 0, 0, 386, 1023), (0.88, 0, 0, 386, 1023), (0.89, 0, 0, 386, 1023), (0.9, 0, 0, 386, 1023), (0.91, 0, 0, 386, 1023), (0.92, 0, 0, 386, 1023), (0.93, 0, 0, 386, 1023), (0.9400000000000001, 0, 0, 386, 1023), (0.9500000000000001, 0, 0, 386, 1023), (0.96, 0, 0, 386, 1023), (0.97, 0, 0, 386, 1023), (0.98, 0, 0, 386, 1023), (0.99, 0, 0, 386, 1023), (1.0, 0, 0, 386, 1023)]
df_scores = pd.DataFrame(scores)
df_scores
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| 0 | 0.00 | 386 | 1023 | 0 | 0 |
| 1 | 0.01 | 385 | 914 | 1 | 109 |
| 2 | 0.02 | 384 | 830 | 2 | 193 |
| 3 | 0.03 | 383 | 766 | 3 | 257 |
| 4 | 0.04 | 381 | 715 | 5 | 308 |
| ... | ... | ... | ... | ... | ... |
| 96 | 0.96 | 0 | 0 | 386 | 1023 |
| 97 | 0.97 | 0 | 0 | 386 | 1023 |
| 98 | 0.98 | 0 | 0 | 386 | 1023 |
| 99 | 0.99 | 0 | 0 | 386 | 1023 |
| 100 | 1.00 | 0 | 0 | 386 | 1023 |
101 rows × 5 columns
columns = ['threshold', 'tp', 'fp', 'fn', 'tn']
df_scores = pd.DataFrame(scores, columns = columns)
df_scores[::10] # looking at each 10th record
| threshold | tp | fp | fn | tn | |
|---|---|---|---|---|---|
| 0 | 0.0 | 386 | 1023 | 0 | 0 |
| 10 | 0.1 | 366 | 556 | 20 | 467 |
| 20 | 0.2 | 333 | 356 | 53 | 667 |
| 30 | 0.3 | 284 | 236 | 102 | 787 |
| 40 | 0.4 | 249 | 166 | 137 | 857 |
| 50 | 0.5 | 210 | 101 | 176 | 922 |
| 60 | 0.6 | 151 | 53 | 235 | 970 |
| 70 | 0.7 | 76 | 20 | 310 | 1003 |
| 80 | 0.8 | 13 | 2 | 373 | 1021 |
| 90 | 0.9 | 0 | 0 | 386 | 1023 |
| 100 | 1.0 | 0 | 0 | 386 | 1023 |
df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)
df_scores[::10]
| threshold | tp | fp | fn | tn | tpr | fpr | |
|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 386 | 1023 | 0 | 0 | 1.000000 | 1.000000 |
| 10 | 0.1 | 366 | 556 | 20 | 467 | 0.948187 | 0.543500 |
| 20 | 0.2 | 333 | 356 | 53 | 667 | 0.862694 | 0.347996 |
| 30 | 0.3 | 284 | 236 | 102 | 787 | 0.735751 | 0.230694 |
| 40 | 0.4 | 249 | 166 | 137 | 857 | 0.645078 | 0.162268 |
| 50 | 0.5 | 210 | 101 | 176 | 922 | 0.544041 | 0.098729 |
| 60 | 0.6 | 151 | 53 | 235 | 970 | 0.391192 | 0.051808 |
| 70 | 0.7 | 76 | 20 | 310 | 1003 | 0.196891 | 0.019550 |
| 80 | 0.8 | 13 | 2 | 373 | 1021 | 0.033679 | 0.001955 |
| 90 | 0.9 | 0 | 0 | 386 | 1023 | 0.000000 | 0.000000 |
| 100 | 1.0 | 0 | 0 | 386 | 1023 | 0.000000 | 0.000000 |
plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR')
plt.xlabel('Threshold')
plt.ylabel('Accuracy')
plt.legend()
<matplotlib.legend.Legend at 0x2930d5ff880>
np.random.seed(1)
y_rand = np.random.uniform(0, 1, size=len(y_val))
y_rand.round(3)
array([0.417, 0.72 , 0. , ..., 0.774, 0.334, 0.089])
((y_rand >= 0.5) == y_val).mean()
0.5017743080198722
def tpr_fpr_dataframe(y_val, y_pred):
scores = []
thresholds = np.linspace(0, 1, 101)
for t in thresholds:
actual_positive = (y_val == 1)
actual_negative = (y_val == 0)
predict_positive = (y_pred >= t)
predict_negative = (y_pred < t)
tp = (predict_positive & actual_positive).sum()
tn = (predict_negative & actual_negative).sum()
fp = (predict_positive & actual_negative).sum()
fn = (predict_negative & actual_positive).sum()
scores.append((t, tp, fp, fn, tn))
columns = ['threshold', 'tp', 'fp', 'fn', 'tn']
df_scores = pd.DataFrame(scores, columns = columns)
df_scores['tpr'] = df_scores.tp / (df_scores.tp + df_scores.fn)
df_scores['fpr'] = df_scores.fp / (df_scores.fp + df_scores.tn)
return df_scores
df_rand = tpr_fpr_dataframe(y_val, y_rand)
df_rand[::10]
plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR')
plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR')
plt.xlabel('Threshold')
plt.ylabel('Accuracy')
plt.legend()
<matplotlib.legend.Legend at 0x2930d5e7640>
num_neg = (y_val == 0).sum()
num_pos = (y_val == 1).sum()
num_neg, num_pos
(1023, 386)
The np.repeat function will create an array with 0's the number of times of num_neg and 1's the number of times of num_pos. This creates the order in the second set in the image.
y_ideal = np.repeat([0, 1], [num_neg, num_pos])
y_ideal
array([0, 0, 0, ..., 1, 1, 1])
y_ideal_pred = np.linspace(0, 1, len(y_val))
1 - y_val.mean()
0.7260468417317246
((y_ideal_pred >= 0.726) == y_ideal).mean()
1.0
df_ideal = tpr_fpr_dataframe(y_ideal, y_ideal_pred)
plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR')
plt.xlabel('Threshold')
plt.ylabel('Accuracy')
plt.legend()
<matplotlib.legend.Legend at 0x2930bcdddb0>
plt.plot(df_scores.threshold, df_scores['tpr'], label='TPR_scores')
plt.plot(df_scores.threshold, df_scores['fpr'], label='FPR_scores')
plt.plot(df_ideal.threshold, df_ideal['tpr'], label='TPR_ideal')
plt.plot(df_ideal.threshold, df_ideal['fpr'], label='FPR_ideal')
#plt.plot(df_rand.threshold, df_rand['tpr'], label='TPR_rand')
#plt.plot(df_rand.threshold, df_rand['fpr'], label='FPR_rand')
plt.legend()
<matplotlib.legend.Legend at 0x293090417b0>
plt.figure(figsize=(5, 5))
plt.plot(df_scores.fpr, df_scores.tpr, label='Model')
#plt.plot(df_rand.fpr, df_rand.tpr, label='random')
#plt.plot(df_ideal.fpr, df_ideal.tpr, label='ideal')
plt.plot([0, 1], [0, 1], label='Random', linestyle='--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
<matplotlib.legend.Legend at 0x2930bf91240>
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_val, y_pred)
plt.figure(figsize=(5, 5))
plt.plot(fpr, tpr, label='Model')
plt.plot([0, 1], [0, 1], label='Random', linestyle='--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.legend()
<matplotlib.legend.Legend at 0x2930e905ae0>
The Area under the ROC curves can tell us how good is our model with a single value. The AUROC of a random model is 0.5, while for an ideal one is 1.
In ther words, AUC can be interpreted as the probability that a randomly selected positive example has a greater score than a randomly selected negative example.
Classes and methods:
auc(x, y) - sklearn.metrics class for calculating area under the curve of the x and y datasets. For ROC curves x would be false positive rate, and y true positive rate.roc_auc_score(x, y) - sklearn.metrics class for calculating area under the ROC curves of the x false positive rate and y true positive rate datasets.from sklearn.metrics import auc
auc(fpr, tpr)
0.8438302463039216
auc(df_scores.fpr, df_scores.tpr)
0.8438365773732646
auc(df_ideal.fpr, df_ideal.tpr)
0.9999430203759136
fpr, tpr, thresholds = roc_curve(y_val, y_pred)
auc(fpr, tpr)
0.8438302463039216
from sklearn.metrics import roc_auc_score
roc_auc_score(y_val, y_pred)
0.8438302463039216
neg = y_pred[y_val == 0]
pos = y_pred[y_val == 1]
neg
array([0.00899416, 0.20483208, 0.21257239, ..., 0.10764076, 0.31400436,
0.13641188])
import random
Below is an example of how roc_auc_score is actually calculated
n = 1000000
success = 0
for i in range(n):
pos_ind = random.randint(0, len(pos) - 1)
neg_ind = random.randint(0, len(neg) - 1)
if pos[pos_ind] > neg[neg_ind]:
success = success + 1
success / n
0.843749
We can also do this with Numpy
n = 1000000
random.seed(1)
pos_ind = np.random.randint(0, len(pos), size=n)
neg_ind = np.random.randint(0, len(neg), size=n)
(pos[pos_ind] > neg[neg_ind]).mean()
0.843908
Extra resources In the lesson we talked about iterators and generators in Python. You can read more about them here:
Notes
Cross-validation refers to evaluating the same model on different subsets of a dataset, getting the average prediction, and spread within predictions. This method is applied in the parameter tuning step, which is the process of selecting the best parameter.
In this algorithm, the full training dataset is divided into k partitions, we train the model in k-1 partiions of this dataset and evaluate it on the remaining subset. Then, we end up evaluating the model in all the k folds, and we calculate the average evaluation metric for all the folds.
In general, if the dataset is large, we should use the hold-out validation dataset strategy. In the other hand, if the dataset is small or we want to know the standard deviation of the model across different folds, we can use the cross-validation approach.
Libraries, classes and methods:
Kfold(k, s, x) - sklearn.model_selection class for calculating the cross validation with k folds, s boolean attribute for shuffle decision, and an x random stateKfold.split(x) - sklearn.Kfold method for splitting the x dataset with the attributes established in the Kfold's object construction.for i in tqdm() - library for showing the progress of each i iteration in a for loop.def train(df_train, y_train, C=1.0):
dicts = df_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(dicts)
model = LogisticRegression(C=C, max_iter=1000)
model.fit(X_train, y_train)
return dv, model
dv, model = train(df_train, y_train, C=0.001)
def predict(df, dv, model):
dicts = df[categorical + numerical].to_dict(orient='records')
X = dv.transform(dicts)
y_pred = model.predict_proba(X)[:, 1]
return y_pred
y_pred = predict(df_val, dv, model)
from sklearn.model_selection import KFold
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
train_idx, val_idx = next(kfold.split(df_full_train))
tqdm is a package that allows us to see progress bars during iteration functions.
import sys
!conda install --yes --prefix {sys.prefix} tqdm
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: C:\Users\daver\Desktop\DataScience\zoomcamp\env
added / updated specs:
- tqdm
The following packages will be downloaded:
package | build
---------------------------|-----------------
certifi-2022.9.14 | py310haa95532_0 155 KB
tqdm-4.64.0 | py310haa95532_0 156 KB
------------------------------------------------------------
Total: 312 KB
The following NEW packages will be INSTALLED:
tqdm pkgs/main/win-64::tqdm-4.64.0-py310haa95532_0
The following packages will be SUPERSEDED by a higher-priority channel:
ca-certificates conda-forge::ca-certificates-2022.9.1~ --> pkgs/main::ca-certificates-2022.07.19-haa95532_0
certifi conda-forge/noarch::certifi-2022.9.14~ --> pkgs/main/win-64::certifi-2022.9.14-py310haa95532_0
openssl conda-forge::openssl-1.1.1q-h8ffe710_0 --> pkgs/main::openssl-1.1.1q-h2bbff1b_0
Downloading and Extracting Packages
tqdm-4.64.0 | 156 KB | | 0%
tqdm-4.64.0 | 156 KB | # | 10%
tqdm-4.64.0 | 156 KB | ### | 31%
tqdm-4.64.0 | 156 KB | #####1 | 51%
tqdm-4.64.0 | 156 KB | ########## | 100%
certifi-2022.9.14 | 155 KB | | 0%
certifi-2022.9.14 | 155 KB | # | 10%
certifi-2022.9.14 | 155 KB | ## | 21%
certifi-2022.9.14 | 155 KB | ####1 | 41%
certifi-2022.9.14 | 155 KB | ########## | 100%
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Retrieving notices: ...working... done
==> WARNING: A newer version of conda exists. <==
current version: 4.14.0
latest version: 22.9.0
Please update conda by running
$ conda update -n base -c defaults conda
from tqdm.auto import tqdm
n_splits = 5
for C in tqdm([0.001, 0.01, 0.1, 0.5, 1, 5, 10]):
kfold = KFold(n_splits=n_splits, shuffle=True, random_state=1)
scores = []
for train_idx, val_idx in kfold.split(df_full_train):
df_train = df_full_train.iloc[train_idx]
df_val = df_full_train.iloc[val_idx]
y_train = df_train.churn.values
y_val = df_val.churn.values
dv, model = train(df_train, y_train, C=C)
y_pred = predict(df_val, dv, model)
auc = roc_auc_score(y_val, y_pred)
scores.append(auc)
print('C=%s %.3f +- %.3f' % (C, np.mean(scores), np.std(scores)))
0%| | 0/7 [00:00<?, ?it/s]
C=0.001 0.825 +- 0.009 C=0.01 0.840 +- 0.009 C=0.1 0.841 +- 0.008 C=0.5 0.841 +- 0.007 C=1 0.841 +- 0.008 C=5 0.840 +- 0.008 C=10 0.841 +- 0.007
Above we see the scores and they are all very similar, 0.001 is the lowest. All the rest are very close and since C=1.0 is the default we will just use that. Since all the scores were very similar we can run the model on the test data set.
dv, model = train(df_full_train, df_full_train.churn.values, C=1.0)
y_pred = predict(df_test, dv, model)
auc = roc_auc_score(y_test, y_pred)
auc
0.8572386167896259
General definitions:
In brief, this weeks was about different metrics to evaluate a binary classifier. These measures included accuracy, confusion table, precision, recall, ROC curves(TPR, FRP, random model, and ideal model), and AUROC. Also, we talked about a different way to estimate the performance of the model and make the parameter tuning with cross-validation.
Other projects